Table of Contents

Project Summary

…explanation of project….

Overview of data splits

Since the random 80% training 20% testing splits are generated 100 times, we anticipate that if the samples are randomly assigned each one should appear in the training set about 80 of the 100 times (80%) and testing about 20 of the 100 times (20%). The plot below visualizes how many times each sample appears in the training and testing datasets.

How often do samples appear in the test set together?

For each pair of samples, I quantified how many splits both samples are in the test set together. On average, samples are both in the test set for 1 split. A majority of sample pairs are never in the test set together (75%). At most there are 2 sample pairs that are both in the test set for 14 splits.

I also split the pairs by diagnosis (both cancer, both normal, or mixed) to make sure there was no difference by dx.

Percent of reads mapped with OptiFit

MCC OptiFit vs OptiClust

HP performance

Model performance during training

Model performance on test data

Averaged ROC Curve

Sensitivity at Specificity Thresholds

Correct Classification Frequency

For each split, a cutoff of 0.5 was used to determine whether each sample was correctly or incorrectly classified. This data was merged for all splits to quantify the percent of splits each sample was correctly classified in. For normal samples it is the frequency that sample was a true negative and for cancer samples its the frequency that sample was a true positive. For example, if a cancer sample had a probabily of being cancer greater than 0.5 for 10 of the 20 splits it was in the test set for, it would have a percent correct value of 50%. I’m trying to figure out how to visualize this data…

The scatter plot below has the samples along the x-axis without labels because they are not readable. The y-axis is the percent correct. The color of the points is the algorithm. There is a separate plot for cancer samples and normal samples.

To eliminate some noise, I removed any samples where there was no difference in the percent correct between the two algorithms.

Prediction Probabilities

I selected the samples where the difference in the correct classification frequency between optifit and opticlust was greater than 0.5 as shown below.

For each of the 104 samples in this set, I plotted the probability of cancer from each split this sample was in the test set for. The dashed line is the 0.5 probability line. Using the 0.5 threshold, a point above the line means the sample would be classified as cancer in that split, below the line normal. For the cancer samples, a point above the line is a true positive while for the normal samples a point above the line is a false positive

For comparison, here are a subset of samples where there is not a difference in the percent correctly classified between optifit and opticlust

Does mcc correlate with sensitivity/specificity?

OptiFit MCC values from just the fit data compared to sensitivity/specificity (threshold = 0.5)